Using Correlation Dimension for Analysing Text Data

نویسندگان

  • Ilkka Kivimäki
  • Krista Lagus
  • Ilari T. Nieminen
  • Jaakko J. Väyrynen
  • Timo Honkela
چکیده

In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-parametric Quantile Regression for Analysing Continuous Longitudinal Responses

Recently, quantile regression (QR) models are often applied for longitudinal data analysis. When the distribution of responses seems to be skew and asymmetric due to outliers and heavy-tails, QR models may work suitably. In this paper, a semi-parametric quantile regression model is developed for analysing continuous longitudinal responses. The error term's distribution is assumed to be Asymmetr...

متن کامل

Genre classification for a corpus of academic webpages

In this paper we report our analysis of the similarities between webpages that are crawled from European academic websites, and comparison of their distribution in terms of the English language variety (native English vs English as a lingua franca) and their language family (based on the country’s official language). After building a corpus of university webpages, we selected a set of relevant ...

متن کامل

A non subjective approach to the GP algorithm for analysing noisy time series

We present an adaptation of the standard Grassberger-Proccacia (GP) algorithm for estimating the Correlation Dimension of a time series in a non subjective manner. The validity and accuracy of this approach is tested using different types of time series, such as, those from standard chaotic systems, pure white and colored noise and chaotic systems added with noise. The effectiveness of the sche...

متن کامل

Bayesian paradigm for analysing count data in longitudina studies using Poisson-generalized log-gamma model

In analyzing longitudinal data with counted responses, normal distribution is usually used for distribution of the random efffects. However, in some applications random effects may not be normally distributed. Misspecification of this distribution may cause reduction of efficiency of estimators. In this paper, a generalized log-gamma distribution is used for the random effects which includes th...

متن کامل

Using Complex Argumentative Interactions to Reconstruct the Argumentative Structure of Large-Scale Debates

In this paper we consider the insights that can be gained by considering large scale argument networks and the complex interactions between their constituent propositions. We investigate metrics for analysing properties of these networks, illustrating these using a corpus of arguments taken from the 2016 US Presidential Debates. We present techniques for determining these features directly from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010